Individual excercise 1:

Compare “Results from JCI Insight” vs “Results derived from your own analysis”

This report contains the result of comparing the results of the RNAseq analysis reported in Braun et al,2022 (https://insight.jci.org/articles/view/154183) with the results obtained by myself using and modifiying the code provided by Prof. Gomez-Cabrero on the class.

Part 1: Load and prepare the data

Getting the counts from GEO

urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path <- paste(urld, "acc=GSE198256", "file=GSE198256_raw_counts_GRCh38.p13_NCBI.tsv.gz", sep="&");
GSE198256_count <- as.matrix(data.table::fread(path, header=T, colClasses="integer"), rownames=1)

Getting the metadata

## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## https://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html
gds <- getGEO("GSE198256")
## Found 1 file(s)
## GSE198256_series_matrix.txt.gz
Meta_GSE198256 <- pData(gds$GSE198256_series_matrix.txt.gz@phenoData)
Meta_GSE198256 <- Meta_GSE198256[,c("title","source_name_ch1","characteristics_ch1","characteristics_ch1.1","description","cell type:ch1","disease state:ch1")]

Factors_GSE198256 <- Meta_GSE198256[,c("disease state:ch1")]

initial number of genes:

## [1] 39376

Then we collect relevant biological information to later perform the filtering with NOISeq For this, we first export the gene’s id to manually look for the relevant biological information in BioMart. This process could be also done using the BiomaRt package from R (I tried since I have used it before but I got an error, apparently because I am calling one of the attributes with the incorrect name, however I did not find any discrepancy between my call and the attribute list. I need to figure out what happened.)

# Write the gene ids to a file, for uploading on BioMart web.
write.table(rownames(GSE198256_count),"gene_names.entrez.txt",
            col.names = FALSE,row.names = FALSE,quote=F)
# Import the information
annotgene <- read.csv("./Resources/mart_export (10).txt",sep="\t",header = T)

Let’s check how many genes I get annotated

sum(rownames(GSE198256_count) %in% annotgene$Entrezgene)
## [1] 25898

And the original number of genes in our count matrix was

nrow(GSE198256_count)
## [1] 39376

As we can see we get annotations for less genes than those contained in the original table Since we want to use NOISeq for a quality control is ok to limit our genes to those that got annotations, however for posterior analysis we will use all the genes. Meanwhile, let’s filter the genes.

# Filter the genes to keep only those annotated to cannonical chromosomes
annotgene <- annotgene[annotgene$Chromosome %in% c(as.character(1:22) ,"X","Y"),]

The number of genes is reduced to:

sum(rownames(GSE198256_count) %in% annotgene$Entrezgene) #25872
## [1] 25872

Some genes can also have more than one annotation, so lets remove the duplicates

annotgene_filt <- annotgene[!duplicated(annotgene$Entrezgene),]
sum(rownames(GSE198256_count) %in% annotgene$Entrezgene)
## [1] 25872
sum(annotgene_filt$Entrezgene %in% rownames(GSE198256_count))
## [1] 25872

then, after removing duplicates we can asign the genenames as the rownames of our object

## Overlap between annotation and genes
rownames(annotgene_filt) <- as.character(annotgene_filt$Entrezgene)
sum(as.character(rownames(annotgene_filt)) %in% rownames(GSE198256_count))
## [1] 25872

Now, we can work with the annotated genes

GSE198256_count_filt <- GSE198256_count[rownames(GSE198256_count) %in% rownames(annotgene_filt),]
GSE198256_count_exc <-GSE198256_count[!(rownames(GSE198256_count) %in% rownames(annotgene_filt)),]
annotgene_ord <- annotgene_filt[rownames(GSE198256_count_filt ),]
#let's check that the size are ok
sum(rownames(annotgene_ord)==rownames(GSE198256_count_filt))
## [1] 25872

Part 2: Exploratory analysis and prefiltering with NOISeq

NOISeq is a R package for quality control of count data, it allow us to identify and control outliers or weather our data has any bias. With NOISeq we can monitor the quality of our data, take better decisions to improve our analysis, and be aware about the features of our data that can bias the interpretation of our results. For quality control and exploration NOISeq require as an input the counts, a factor (the variable of interest in our analysis/comparison) and some biological information: such as gene size, localization in chromosome, gene byotype and GC content.

## Loading required package: splines
## Loading required package: Matrix

We are going to use the annotations we collected from BioMart to extract the additional biological information required by NOISeq, all the information is in annotgene_ord

##              GC start   end Chromosome   type Entrezgene
## 100287102 57.50 11869 14409          1 lncRNA  100287102
## 102466751 61.76 17369 17436          1  miRNA  102466751
## 100302278 31.16 30366 30503          1  miRNA  100302278

Now, from the previous one let’s extract the biological data in the correct format required by NOISeq > gene id and value.

#Adding additional biological data 
lengthuse <- abs(annotgene_ord$end-annotgene_ord$start)
names(lengthuse) <- rownames(annotgene_ord)
gc <- annotgene_ord$GC
names(gc) <- rownames(annotgene_ord)
biotype <-annotgene_ord$type
names(biotype) <- rownames(annotgene_ord)
chromosome <- annotgene_ord[,c("Chromosome","start","end")]

Now, lets define the factors

Factors_GSE198256 <- data.frame(Meta_GSE198256 [ colnames(GSE198256_count_filt),c("disease state:ch1")])
colnames(Factors_GSE198256)[1]<- "Group"

Now, we have all the necessary information, let’s create the NOISeq object.

data_NOISEQ <- readData(data = GSE198256_count_filt,
                        length=lengthuse,
                        gc=gc,
                        biotype= biotype,
                        chromosome = annotgene_ord[,c("Chromosome","start","end")],
                        factors = Factors_GSE198256)

Next, we can generate some exploratory plots to evaluate the quality of our data (I am not generating all available, just the most representatives)

Byotype detection:plot the abundance of the different biotypes (e.g protein coding, lnc-RNA etc) in the genome with % of genes detected in the sample/condition and within the sample/condition.

## Biotypes detection is to be computed for:
##  [1] "GSM5942339" "GSM5942340" "GSM5942341" "GSM5942342" "GSM5942343"
##  [6] "GSM5942344" "GSM5942345" "GSM5942346" "GSM5942347" "GSM5942348"
## [11] "GSM5942349" "GSM5942350" "GSM5942351" "GSM5942352" "GSM5942353"
## [16] "GSM5942354" "GSM5942355" "GSM5942356" "GSM5942357" "GSM5942358"
## [21] "GSM5942359" "GSM5942360" "GSM5942361" "GSM5942362" "GSM5942363"
## [26] "GSM5942364" "GSM5942365" "GSM5942366" "GSM5942367" "GSM5942368"
## [31] "GSM5942369" "GSM5942370" "GSM5942371" "GSM5942372"

Saturation:Represent the number of detected genes (counts>0) per sample across different sequencing depths. It is also posible to generate the saturation plot for specific byotypes, like in the second example below. In that case represent the number of detected genes of the specific biotype across sequencing depths.

Length bias:Mean gene expression per each length bin. A fitted curve and diagnostic test are also produced.

## [1] "Warning: 4402 features with 0 counts in all samples are to be removed for this analysis."
## [1] "Length bias detection information is to be computed for:"
## [1] "Covid19: Acute infection" "Covid19: Recovery 3Mo"   
## [3] "Covid19: Recovery 6Mo"    "Healthy"                 
## [1] "Covid19: Acute infection"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.911  -4.683  -0.612   2.443  76.882 
## 
## Coefficients: (3 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    6.856     10.112   0.678  0.49938   
## bx1            5.307     10.404   0.510  0.61112   
## bx2           18.797     11.096   1.694  0.09352 . 
## bx3           31.606     12.199   2.591  0.01108 * 
## bx4           34.033     12.907   2.637  0.00978 **
## bx5           44.134     14.821   2.978  0.00368 **
## bx6           26.317     18.781   1.401  0.16440   
## bx7           24.610     28.088   0.876  0.38315   
## bx8           36.512     54.607   0.669  0.50535   
## bx9          -21.023    111.549  -0.188  0.85092   
## bx10          39.474     95.600   0.413  0.68061   
## bx11          14.260     16.692   0.854  0.39508   
## bx12              NA         NA      NA       NA   
## bx13              NA         NA      NA       NA   
## bx14              NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.11 on 95 degrees of freedom
## Multiple R-squared:  0.5433, Adjusted R-squared:  0.4904 
## F-statistic: 10.28 on 11 and 95 DF,  p-value: 4.066e-12
## 
## [1] "Covid19: Recovery 3Mo"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.592  -4.774  -0.605   2.684 119.975 
## 
## Coefficients: (3 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    7.926     13.907   0.570   0.5701  
## bx1            6.998     14.309   0.489   0.6259  
## bx2           11.385     15.261   0.746   0.4575  
## bx3           26.902     16.778   1.603   0.1122  
## bx4           26.663     17.752   1.502   0.1364  
## bx5           43.826     20.385   2.150   0.0341 *
## bx6           19.515     25.831   0.755   0.4518  
## bx7           27.683     38.632   0.717   0.4754  
## bx8           23.538     75.105   0.313   0.7547  
## bx9            2.597    153.423   0.017   0.9865  
## bx10          24.305    131.487   0.185   0.8537  
## bx11          18.537     22.957   0.807   0.4214  
## bx12              NA         NA      NA       NA  
## bx13              NA         NA      NA       NA  
## bx14              NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.91 on 95 degrees of freedom
## Multiple R-squared:  0.2976, Adjusted R-squared:  0.2163 
## F-statistic:  3.66 on 11 and 95 DF,  p-value: 0.0002395
## 
## [1] "Covid19: Recovery 6Mo"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.922  -4.884  -0.487   2.432  73.086 
## 
## Coefficients: (3 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    8.139     10.008   0.813  0.41807   
## bx1            5.125     10.296   0.498  0.61981   
## bx2           18.789     10.981   1.711  0.09034 . 
## bx3           25.179     12.073   2.085  0.03971 * 
## bx4           30.032     12.774   2.351  0.02079 * 
## bx5           40.498     14.668   2.761  0.00692 **
## bx6           18.847     18.588   1.014  0.31318   
## bx7           25.719     27.799   0.925  0.35722   
## bx8           27.586     54.044   0.510  0.61093   
## bx9           -6.744    110.400  -0.061  0.95142   
## bx10          29.213     94.615   0.309  0.75819   
## bx11          15.187     16.520   0.919  0.36025   
## bx12              NA         NA      NA       NA   
## bx13              NA         NA      NA       NA   
## bx14              NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.01 on 95 degrees of freedom
## Multiple R-squared:  0.4577, Adjusted R-squared:  0.395 
## F-statistic:  7.29 on 11 and 95 DF,  p-value: 6.795e-09
## 
## [1] "Healthy"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.506  -5.049  -0.397   2.055 122.350 
## 
## Coefficients: (3 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    8.668     14.151   0.613   0.5417  
## bx1            6.170     14.559   0.424   0.6727  
## bx2           11.731     15.528   0.755   0.4518  
## bx3           29.188     17.072   1.710   0.0906 .
## bx4           28.229     18.063   1.563   0.1214  
## bx5           47.408     20.742   2.286   0.0245 *
## bx6           17.702     26.284   0.673   0.5023  
## bx7           28.622     39.308   0.728   0.4683  
## bx8           30.383     76.420   0.398   0.6918  
## bx9          -11.080    156.109  -0.071   0.9436  
## bx10          34.762    133.789   0.260   0.7956  
## bx11          17.101     23.359   0.732   0.4659  
## bx12              NA         NA      NA       NA  
## bx13              NA         NA      NA       NA  
## bx14              NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.15 on 95 degrees of freedom
## Multiple R-squared:  0.3359, Adjusted R-squared:  0.259 
## F-statistic: 4.368 on 11 and 95 DF,  p-value: 2.773e-05

GC Bias: Mean gene expression per each GC content bin.A fitted curve and diagnostic test are also produced.

## [1] "Warning: 4402 features with 0 counts in all samples are to be removed for this analysis."
## [1] "GC content bias detection is to be computed for:"
## [1] "Covid19: Acute infection" "Covid19: Recovery 3Mo"   
## [3] "Covid19: Recovery 6Mo"    "Healthy"                 
## [1] "Covid19: Acute infection"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9344 -2.5596 -0.4698  1.9421 19.6481 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.904      5.181   0.174  0.86186    
## bx1            3.783      7.327   0.516  0.60691    
## bx2         3046.020   2687.824   1.133  0.26002    
## bx3          -69.232     35.113  -1.972  0.05162 .  
## bx4           53.828      8.491   6.340 8.19e-09 ***
## bx5           31.939      6.215   5.139 1.52e-06 ***
## bx6           34.478      5.953   5.792 9.41e-08 ***
## bx7           18.588      6.018   3.089  0.00265 ** 
## bx8           19.065      6.101   3.125  0.00237 ** 
## bx9           15.034      6.378   2.357  0.02051 *  
## bx10           8.934      7.150   1.250  0.21461    
## bx11          15.093      9.553   1.580  0.11751    
## bx12         -11.281     32.776  -0.344  0.73148    
## bx13          27.900    211.296   0.132  0.89524    
## bx14              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.181 on 93 degrees of freedom
## Multiple R-squared:  0.8153, Adjusted R-squared:  0.7895 
## F-statistic: 31.58 on 13 and 93 DF,  p-value: < 2.2e-16
## 
## [1] "Covid19: Recovery 3Mo"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9195 -2.9800 -0.9105  2.2264 17.4665 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.8506     5.4641   0.156 0.876624    
## bx1            2.9233     7.7274   0.378 0.706065    
## bx2         2834.6358  2834.7400   1.000 0.319923    
## bx3          -80.6933    37.0322  -2.179 0.031858 *  
## bx4           49.5330     8.9550   5.531 2.90e-07 ***
## bx5           25.3094     6.5544   3.861 0.000208 ***
## bx6           31.3267     6.2787   4.989 2.81e-06 ***
## bx7           18.5210     6.3472   2.918 0.004417 ** 
## bx8           18.1961     6.4341   2.828 0.005735 ** 
## bx9           16.1737     6.7266   2.404 0.018179 *  
## bx10          10.1207     7.5409   1.342 0.182826    
## bx11          14.6263    10.0746   1.452 0.149923    
## bx12          -5.9216    34.5674  -0.171 0.864356    
## bx13          -0.7437   222.8450  -0.003 0.997344    
## bx14               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.464 on 93 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6623 
## F-statistic: 16.99 on 13 and 93 DF,  p-value: < 2.2e-16
## 
## [1] "Covid19: Recovery 6Mo"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6239 -2.9235 -0.7266  2.3116 22.8849 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.194      5.369   0.222 0.824496    
## bx1            2.940      7.593   0.387 0.699447    
## bx2         2280.998   2785.270   0.819 0.414908    
## bx3          -64.502     36.386  -1.773 0.079551 .  
## bx4           47.850      8.799   5.438 4.32e-07 ***
## bx5           25.740      6.440   3.997 0.000128 ***
## bx6           32.896      6.169   5.332 6.77e-07 ***
## bx7           19.862      6.236   3.185 0.001971 ** 
## bx8           20.614      6.322   3.261 0.001553 ** 
## bx9           17.647      6.609   2.670 0.008952 ** 
## bx10          12.582      7.409   1.698 0.092837 .  
## bx11          17.830      9.899   1.801 0.074915 .  
## bx12          -7.822     33.964  -0.230 0.818354    
## bx13          -3.508    218.956  -0.016 0.987251    
## bx14              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.369 on 93 degrees of freedom
## Multiple R-squared:  0.6822, Adjusted R-squared:  0.6378 
## F-statistic: 15.36 on 13 and 93 DF,  p-value: < 2.2e-16
## 
## [1] "Healthy"
## 
## Call:
## lm(formula = datos[, i] ~ bx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.078  -3.014  -0.553   2.167  21.200 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.8403     5.2538   0.160  0.87328    
## bx1            3.6323     7.4300   0.489  0.62608    
## bx2         3043.7543  2725.6463   1.117  0.26700    
## bx3          -74.4242    35.6070  -2.090  0.03933 *  
## bx4           54.1499     8.6104   6.289 1.03e-08 ***
## bx5           28.4345     6.3022   4.512 1.88e-05 ***
## bx6           32.8538     6.0370   5.442 4.25e-07 ***
## bx7           18.7056     6.1029   3.065  0.00285 ** 
## bx8           18.0965     6.1865   2.925  0.00433 ** 
## bx9           15.1394     6.4677   2.341  0.02138 *  
## bx10           9.3411     7.2507   1.288  0.20084    
## bx11          14.8243     9.6869   1.530  0.12933    
## bx12          -8.8117    33.2371  -0.265  0.79151    
## bx13          12.1562   214.2689   0.057  0.95488    
## bx14               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.254 on 93 degrees of freedom
## Multiple R-squared:  0.783,  Adjusted R-squared:  0.7527 
## F-statistic: 25.82 on 13 and 93 DF,  p-value: < 2.2e-16

Exploratory PCA: Principal component analysis score plots for PC1 vs PC2

Part 3: Normalization

An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for factors that prevent direct comparison of expression measures. Errors in normalization can have a significant impact on downstream analysis, such as inflated false positives in differential expression analysis(Evans et al, 2018).There are multiple normalization methods.Some of them are:

RPKM:Is a normalization by library size, aims to remove differences in sequencing depth simply by dividing the total number of reads in each sample.The RPKM method add a component to account for gene lenght as well. FPKM and ERPKM are variants of this method. After dividing by library size, the normalized counts reflect the proportion of total mRNA/cell taken up by each gene. If the total mRNA/cell is the same across conditions, this proportion reflects absolute mRNA/cell for each gene (Evans et al, 2018). In R we can do it using NOISeq as:

myRPKM = rpkm(assayData(data_NOISEQ)$exprs, long = lengthuse, k = 0, lc = 1)

UQUA:The Upper Quartile normalization is a method of normalization by distribution. The Upper Quartile normalization divides each read count by the 75th percentile of the read counts in its sample (Bullard et al, 2010). In R we can do it with NOISeq as:

myUQUA = uqua(assayData(data_NOISEQ)$exprs, long = lengthuse, lc = 0.5, k = 0)

TMM:The Trimmed mean of the M-values (TMM) is also a normalization by distribution, this approach chooses a sample as a reference sample, then calculate fold changes and absolute expression levels relative to that sample. The genes are trimmed twice by these two values, to remove differentially express genes, then the trimmed mean of the fold changes is found for each sample. Read counts are scaled by this trimmed mean and the total count of their sample. With NOISeq the TMM normalization can be applied as follows:

myTMM = tmm(assayData(data_NOISEQ)$exprs, long = 1000, lc = 0)

Part 4: Differential expression analysis with DESeq2 and Comparision with the Original Paper from Braun et al 2022, JCI Insight

We start by creating the DESeq2 object for the analysis

library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:Matrix':
## 
##     expand, unname
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
## 
##     rowMedians
GSE198256_DESeq2 <- DESeqDataSetFromMatrix(countData = GSE198256_count_filt,
                              colData = pData(data_NOISEQ),
                              design = ~ Group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]

Now, we will Modifying the labels to match the paper format and avoid warnings due to unacepted characters.

pDataUSE <- pData(data_NOISEQ)
pDataUSE[pDataUSE=="Healthy"] <- "Control"
pDataUSE[pDataUSE=="Covid19: Acute infection"] <- "Acute"
pDataUSE[pDataUSE=="Covid19: Recovery 3Mo"] <- "Early_recovery"
pDataUSE[pDataUSE=="Covid19: Recovery 6Mo"] <- "Late_recovery"
pDataUSE[,1] <- as.factor(pDataUSE[,1])

And let’s try again

GSE198256_DESeq2 <- DESeqDataSetFromMatrix(countData = GSE198256_count_filt,
                                           colData = pDataUSE,
                                           design = ~ Group)
##   it appears that the last variable in the design formula, 'Group',
##   has a factor level, 'Control', which is not the reference level. we recommend
##   to use factor(...,levels=...) or relevel() to set this as the reference level
##   before proceeding. for more information, please see the 'Note on factor levels'
##   in vignette('DESeq2').

Now, in the paper the authors mentioned they performed the DE analysis by comparing the different groups of COVID19 patients with the healthy controls. So, let’s define the Controls as our reference group for the contrasts by using relevel function.

GSE198256_DESeq2$Group <- relevel(GSE198256_DESeq2$Group, ref="Control")

With the current model we are including all genes from GSE198256_count_filt in our analysis, but we can remove the genes with low expression to avoid noise and bias in our DEGs. To do so, we will use the filtering criteria suggested by prof. Gomez-Cabrero in which we retain genes genes with counts >=10 in at least 6 samples (the size of the smallest group in our contrasts, that in this case correspond to the Early recovery group (n=6))

smallestGroupSize <- 6
keep <- rowSums(counts(GSE198256_DESeq2) >= 10) >= smallestGroupSize
GSE198256_DESeq2_F <- GSE198256_DESeq2[keep,]
nrow(GSE198256_DESeq2_F)
## [1] 14055

Comparison with the paper This is our first difference with Brauns et al,2022. Since their filtering criteria was to keep genes with counts >=20 in at least one sample. This filtering is more relaxed than ours, and using it we ended with more genes as we can see below:

#Trying the filtering criteria used in the paper
keep_paper <- rowSums(counts(GSE198256_DESeq2) >= 20) >= 1
GSE198256_DESeq2_Paper <- GSE198256_DESeq2[keep_paper,]
nrow(GSE198256_DESeq2_Paper)
## [1] 15573

Now, lets go back to our initial filter and perform the DEG analysis.

GSE198256_DESeq2_F<- DESeq(GSE198256_DESeq2_F)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 13 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
GSE198256_res <- results(GSE198256_DESeq2_F)
#Checking we have the correct contrasts, allways control as reference.
resultsNames(GSE198256_DESeq2_F)
## [1] "Intercept"                       "Group_Acute_vs_Control"         
## [3] "Group_Early_recovery_vs_Control" "Group_Late_recovery_vs_Control"

Now, lets explore the number of DEGs we obtain in each contrast.

Initial DEGs Acute vs Control

summary(results(GSE198256_DESeq2_F, contrast = c('Group', 'Acute', 'Control')))
## 
## out of 14055 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1093, 7.8%
## LFC < 0 (down)     : 1442, 10%
## outliers [1]       : 7, 0.05%
## low counts [2]     : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Initial DEGs Early recovery vs Control

summary(results(GSE198256_DESeq2_F, contrast = c('Group', 'Early_recovery', 'Control')))
## 
## out of 14055 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1184, 8.4%
## LFC < 0 (down)     : 1246, 8.9%
## outliers [1]       : 7, 0.05%
## low counts [2]     : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Initial DEGs Late recovery vs Control

summary(results(GSE198256_DESeq2_F, contrast = c('Group', 'Late_recovery', 'Control')))
## 
## out of 14055 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 40, 0.28%
## LFC < 0 (down)     : 98, 0.7%
## outliers [1]       : 7, 0.05%
## low counts [2]     : 11437, 81%
## (mean count < 780)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Now lets explore a little our data with the MAplot

plotMA(GSE198256_res, ylim=c(-2,2))

Interpretation From the previous plot we can see that at low A (x-axis) the M (y-axis) is more variable, in other words, from genes with few counts (low A) we got very variable LogFC (M), therefor most of this are errors. As expected the more we move to the right on the x-axis ( more counts) the plot get more thigh, so the LogFC is less variable therefor the error is lower.To correct this problem, we shrink the fold changes, which consist in looks at the largest fold changes that are not due to low counts (the right on the x-axis) and uses these to inform a prior distribution. So the large fold changes from genes with lots of statistical information are not shrunk, while the imprecise fold changes (from genes with low counts) are shrunk.

#now shrink the fold changes
res_acute <- lfcShrink(GSE198256_DESeq2_F, coef = c('Group_Acute_vs_Control'))
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
res_early <- lfcShrink(GSE198256_DESeq2_F, coef = c('Group_Early_recovery_vs_Control'))
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
res_late <- lfcShrink(GSE198256_DESeq2_F, coef = c('Group_Late_recovery_vs_Control')) 
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

Let’s plot again the MAPlot to look at the effect of shrinking.

Acute vs Control

Early recovery vs Control

Late recovery vs Control

Now, let’s filter our DEGs using the same criteria as the original author: FDR<0.5 and |FC| > 2. Notice, A FC of 2 correspond to a log2fc = 1, since log2(2) = 1.

deg_acute = as.data.frame(res_acute)[which(res_acute$padj < 0.05 & abs(res_acute$log2FoldChange)>1),]
nrow(deg_acute) # 803 degs
## [1] 803
deg_early = as.data.frame(res_early)[which(res_early$padj < 0.05 & abs(res_early$log2FoldChange)>1),] 
nrow(deg_early)# 574 degs
## [1] 574
deg_late = as.data.frame(res_late)[which(res_late$padj < 0.05 & abs(res_late$log2FoldChange)>1),] 
nrow(deg_late)# 16 degs
## [1] 16

Comparison with the paper As expected, the number of DEGs we obtain is not the same as the number of DEGs reported in the paper, since we have use different filtering criteria of the counts matrix and we did shrink (The paper does not specify if they did it also). We got more DEGs (n=803) in the contrast Acute vs Control while the authors reported 339 DEGs for the same contrast. In the second contrast Early recovery vs Control we have slighly higher number of DEGs (n=574) than those reported in the paper (n=521). For the last contrast, Late recovery vs Control we got only 16 DEGs.In concordance, the paper’s authors mentioned they found “harldy any DEG” however they don’t specify the number.

Part 5: Visualization and Comparision with the Original Paper from Braun et al 2022, JCI Insight

I tried to recreate figure 6 from the paper. Since is the main figure describing the transcriptomic profile of the monocytes during COVID19 recovery.

## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.16.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
## ========================================
## circlize version 0.4.15
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================

first I recreate the PCA

#Transform the data for visualization
vsd <- vst(GSE198256_DESeq2_F, blind=FALSE)

pca <- plotPCA(vsd, intgroup=c("Group"), returnData = TRUE)
ggplot(pca, 
       aes(x = pca$PC1, 
           y = pca$PC2, 
           color = pDataUSE$Group)) +
  geom_point() +
  labs(x = "PC1: 41% variance", y = "PC2: 17% variance") +
  stat_ellipse()

Comparison with the paper The distribution of the samples and the clusters in my PCA follows the same pattern as the PCA from the paper, there is a slighty difference in the variance percentage described by PC1 and PC2.

Then I recreate the heatmap

#subset expression table to degs in each category.  
degslist <- paste(c(rownames(deg_acute),rownames(deg_early), rownames(deg_late))) #1339 genes
#Keep only unique ids to avoid ploting same gene multiple times
degslist <- unique(degslist) #1314 genes
#Produce the z-score
avsd <- assay(vsd)
Z <- t(scale(t(avsd)))
#Extract my genes of interest - DEGs
Z_degs <- Z[rownames(Z) %in% degslist,]

ha = HeatmapAnnotation(
  df = pDataUSE$Group,
  annotation_height = unit(4, "mm")
)

#seed for the clustering, this for reproducibility 
set.seed(123)
hm <- draw(Heatmap(Z_degs, name = "z-score", km = 5, top_annotation = ha,
        show_row_names = FALSE, show_column_names = FALSE, cluster_columns = FALSE))

Interpretation and Comparison with the paper I have use the k-means method to produce 5 clusters as in the paper. The number of elements in my clusters differ form those proposed in the paper.Initially, because the DEGs are different. However I have a similar pattern of expression in my clusters.To corroborate, is necesary to verify weather the representative genes of each cluster are also present in my heatmap and if belongs to the clusters with similar paterns of expression (color). For this, I have extract the ids of the genes en each cluster using a code available in:https://github.com/jokergoo/ComplexHeatmap/issues/136 published by the user guidohooiveld

rcl.list <- row_order(hm)  #Extract clusters (output is a list)
# loop to extract genes for each cluster.
   for (i in 1:length(row_order(hm))){
     if (i == 1) {
       clu <- t(t(row.names(Z_degs[row_order(hm)[[i]],])))
       out <- cbind(clu, paste("cluster", i, sep=""))
       colnames(out) <- c("GeneID", "Cluster")
       } else {
         clu <- t(t(row.names(Z_degs[row_order(hm)[[i]],])))
         clu <- cbind(clu, paste("cluster", i, sep=""))
         out <- rbind(out, clu)
         }
     }
#check 
head(out,n=3)
##      GeneID Cluster   
## [1,] "2867" "cluster1"
## [2,] "4923" "cluster1"
## [3,] "2184" "cluster1"

Then I exported the list of Ids in a text file to annotate the gene name in BioMart and compare with the cluster’s representative genes reported on the paper.

##export to later look for the gene name in bioMart and compare with the 
#representative genes per cluster from the paper.
#write.table(out, file= "gene_clusters.txt", sep="\t", quote=F, row.names=FALSE)
#dim(out)
out<-as.data.frame(out)
#Load the annotations obtained from biomart
annotclust <- read.csv("gene_names_clusters.txt",sep="\t",header = T)
head(annotclust, n=3)
##      GeneID Gene.name
## 1 100124536  SNORA38B
## 2 100128059          
## 3 100128164    FHL1P1
#dim(annotclust)
# [1] 1322    2 
#bigger than expected, so we remove any duplicated data 
annotclust <- annotclust[!duplicated(annotclust$GeneID), ] 
sum(annotclust$GeneID %in% out$GeneID) #1314 ok
## [1] 1314
#Then merge the list of genes on our clusters with the annotations of biomart to pair gene id and gene name 
genes_cluster=merge(annotclust, out, by.y="GeneID", all.x=TRUE)
head(genes_cluster, n=3)
##   GeneID Gene.name  Cluster
## 1     19     ABCA1 cluster4
## 2    133       ADM cluster1
## 3    135   ADORA2A cluster4

Now that we have the gene name of our genes of interest, lets create a vector containing the names of the representative genes reported in Figure 6.B from Brauns et al, 2022. The representative genes reported on the paper are 84. The next step is subset our genes_cluster to extract those annotated as representative in the paper.

rep_genes_paper <- c("DHCR24", "ACSL1", "ALOX15B", "HPGD", "MSMO", "LDHA1", "SCDDHFR", "MPO", "S100A12",
               "S100A8", "CLU", "HMGB2", "VSIG4", "FCGR1A", "LAIR1", "CD163", "IFI27", "IL1R2",
               "FLT3", "SQLE", "SPRY2", "PPARG", "CCL2", "CCL4", "CCL7", "CXCL1", "CXCL16", "CXCL2",
               "TNFRSF12A", "CSF1", "IL1R1", "TNFSF14", "TNFSF8", "DDIT3", "EIF2AK3", "FOXO3", "HMGA1",
               "JARID2", "IRF2BP2", "IRAK2", "IL1RN", "DOT1", "SOCS6L", "HAVCR2", "MAFF", "MAFB", "IL1B",
               "GPR183", "CIITA", "C3", "FTH1", "KLF12", "HLA-DMA", "CLEC10A", "CSF1R", "HLA-DMB", 
               "MAP3K14", "GPR183", "HLA-DQA1", "CD4", "VEGFA", "HLA-DQA2", "OSMa", "PTGER4", "HLA-DPA1",
               "CLEC4A", "CSF1R", "HLA-DOA", "SRCCAMK2D", "KLF11", "HLA-DRA", "NRP1", "PDGFC", "HLA-DRB1",
               "NR4A1", "NFE2L3", "ATF3", "JUNB", "NFKB2", "PLAG1", "METTL17", "APAF1", "HIST1H3C", "ZNF490")

#look for the representative genes in my clusters. 
my_rep_genes <- subset(genes_cluster, genes_cluster$Gene.name %in% rep_genes_paper)
head(my_rep_genes, n =3)
##    GeneID Gene.name  Cluster
## 12    247   ALOX15B cluster1
## 23    467      ATF3 cluster4
## 41    718        C3 cluster4

In our clusters we identified 59 out of the 84 representative genes reported on the paper. And our genes partially share the clustering patern, for example in the paper the “HLA” family of genes are all clustering together, which is in agreement with our observation where the HLA family also belongs to the same cluster. Similarly, cluster II from Brauns and cluster 5 from has show a siliar pattern of expression and contains multiple genes in common. For more details refere to the comparative figure below.